Kaggle | 결측치의 처리

python
titanic
impute
Author

강신성

Published

2023-10-22

Titanic 데이터에는 결측치가 상당히 많았는데, 그것을 처리해서 분석해보자.

1. 라이브러리 imports

#!pip install missingno

import pandas as pd
import numpy as np
import sklearn.impute
import sklearn.linear_model
import matplotlib.pyplot as plt
import missingno as msno

2. 데이터 불러오기

#!kaggle competitions download -c titanic
#!unzip titanic.zip -d ./titanic
#df_train = pd.read_csv('titanic/train.csv')
#df_test = pd.read_csv('titanic/test.csv')
#!rm titanic.zip
#!rm -rf titanic/

## 리눅스 서버가 구축되어 있다면 데이터를 바로 불러오기가 편리하다.
df_test = pd.read_csv('./data/test.csv')
df_train = pd.read_csv('./data/train.csv')

3. 결측치 확인 및 처리

결측치 확인

df_train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

시각화

msno.matrix(df_train)

결측치 처리

수치형은 수치형끼리, 범주형은 범주형끼리 처리하자.

df_imputed = df_train.copy()

train_num = df_train.select_dtypes(include = 'number')
train_obj = df_train.select_dtypes(exclude = 'number')

df_imputed[train_num.columns] = sklearn.impute.SimpleImputer(strategy = 'mean').fit_transform(train_num)
df_imputed[train_obj.columns] = sklearn.impute.SimpleImputer(strategy = 'most_frequent')

df_imputed.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    float64
 1   Survived     891 non-null    float64
 2   Pclass       891 non-null    float64
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          891 non-null    float64
 6   SibSp        891 non-null    float64
 7   Parch        891 non-null    float64
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        891 non-null    object 
 11  Embarked     891 non-null    object 
dtypes: float64(7), object(5)
memory usage: 83.7+ KB

결측치가 완전히 메꿔진 것을 확인할 수 있다.

4. 분석(?)

늘 해왔던 것처럼…

set(df_train.columns) - set(df_test.columns)
{'Survived'}

Survived : 반응변수

- 근데 몇 번 결측치 처리를 반복해야 하므로 위에서의 과정을 함수로 만들어버리자.

def impute_missing(df):
    """
    imputing missing and output whole dataframe
    
    df : DataFrame include NaN value
    """
    df_ = df.copy()  ## 데이터를 복사
    
    df_num = df_.select_dtypes(include = 'number')  ## 해당하는 데이터 타입만 선택
    df_obj = df_.select_dtypes(exclude = 'number')
    
    df_[df_num.columns] = sklearn.impute.SimpleImputer(strategy = 'mean').fit_transform(df_num)
    df_[df_obj.columns] = sklearn.impute.SimpleImputer(strategy = 'most_frequent').fit_transform(df_obj)
    
    return df_
pd.get_dummies(impute_missing(df_train.drop(['Survived'], axis = 1)))
PassengerId Pclass Age SibSp Parch Fare Name_Abbing, Mr. Anthony Name_Abbott, Mr. Rossmore Edward Name_Abbott, Mrs. Stanton (Rosa Hunt) Name_Abelson, Mr. Samuel ... Cabin_F G73 Cabin_F2 Cabin_F33 Cabin_F38 Cabin_F4 Cabin_G6 Cabin_T Embarked_C Embarked_Q Embarked_S
0 1.0 3.0 22.000000 1.0 0.0 7.2500 False False False False ... False False False False False False False False False True
1 2.0 1.0 38.000000 1.0 0.0 71.2833 False False False False ... False False False False False False False True False False
2 3.0 3.0 26.000000 0.0 0.0 7.9250 False False False False ... False False False False False False False False False True
3 4.0 1.0 35.000000 1.0 0.0 53.1000 False False False False ... False False False False False False False False False True
4 5.0 3.0 35.000000 0.0 0.0 8.0500 False False False False ... False False False False False False False False False True
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
886 887.0 2.0 27.000000 0.0 0.0 13.0000 False False False False ... False False False False False False False False False True
887 888.0 1.0 19.000000 0.0 0.0 30.0000 False False False False ... False False False False False False False False False True
888 889.0 3.0 29.699118 1.0 2.0 23.4500 False False False False ... False False False False False False False False False True
889 890.0 1.0 26.000000 0.0 0.0 30.0000 False False False False ... False False False False False False False True False False
890 891.0 3.0 32.000000 0.0 0.0 7.7500 False False False False ... False False False False False False False False True False

891 rows × 1730 columns

늘 그랬던 것처럼 get_dummies를 해줬는데… 뭔가 이상하다.

pd.get_dummies(impute_missing(df_train.drop(['Survived'], axis = 1))).shape
(891, 1730)

행이 1730개??? > 이러한 상황에서는 선형 모델이 제대로 작동하지 않는다…!

# step 1
X = pd.get_dummies(impute_missing(df_train.drop(['Survived'], axis = 1)))
y = df_train.Survived
XX = pd.get_dummies(impute_missing(df_test))

# step 2
predictr = sklearn.linear_model.LogisticRegression()

# step 3
predictr.fit(X, y)

# step 4
predictr.predict(XX)
C:\Users\hollyriver\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py:460: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
ValueError: The feature names should match those that were passed during fit.
Feature names unseen at fit time:
- Cabin_A11
- Cabin_A18
- Cabin_A21
- Cabin_A29
- Cabin_A9
- ...
Feature names seen at fit time, yet now missing:
- Cabin_A10
- Cabin_A14
- Cabin_A16
- Cabin_A19
- Cabin_A20
- ...
{c:len(set(df_train[c])) for c in df_train.select_dtypes(include="object").columns}
{'Name': 891, 'Sex': 2, 'Ticket': 681, 'Cabin': 148, 'Embarked': 4}
  • 형식이 object인 것들이 가지고 있는 유니크한 값들이 몇개인지를 딕셔너리 컴프리헨션 해봤다.

사실상 해당 수가 엄청나게 많은 Name, Ticket, Cabin의 경우 없애는 편이 더 좋아보인다.

5. 진짜 분석

## step 1
X = pd.get_dummies(impute_missing(df_train).drop(['Name', 'Ticket', 'Cabin', 'Survived'], axis = 1))
y = df_train.Survived
XX = pd.get_dummies(impute_missing(df_test).drop(['Name', 'Ticket', 'Cabin'], axis = 1))

## step 2
predictr = sklearn.linear_model.LogisticRegression()

## step 3
predictr.fit(X, y)

## step 4
df_test[['PassengerId']].assign(Survived = predictr.predict(XX))
C:\Users\hollyriver\anaconda3\envs\py\lib\site-packages\sklearn\linear_model\_logistic.py:460: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
PassengerId Survived
0 892 0
1 893 1
2 894 0
3 895 0
4 896 1
... ... ...
413 1305 0
414 1306 1
415 1307 0
416 1308 0
417 1309 0

418 rows × 2 columns

df_test[['PassengerId']].assign(Survived = predictr.predict(XX)).to_csv("submission", index = False)

이렇게 하면 된다.